1 Introduction

My Github repo is here

You can find my data wrangling script here


2 Preparatory actions

Before we can start analysing the student data, some groundwork is needed to prepare the environment for the analysis, and actually import the data as well.

Much of the data preparation work towards better analysis possibilities with MCA has been done already in the data wrangling script.

2.1 Library import

We’ll start off by importing the necessary libraries that will provide us the many of the functions we’ll use in this assignment. Because the libraries merely provide us the functions we’ll describe when they’re used in the context of the data, I’ve included short descriptions of the libraries as code comments below.

library(FactoMineR) # This includes a lot of factor analysis stuff
library(factoextra) # Considered, if you will, an extension to FactoMiner with more progressive plotting capabilities
library(ggplot2) # The do-all plotting library
library(dplyr) # Everyone needs to manipulate data
library(corrplot) # Here to provide more beautiful correlation coefficient plotting
library(tidyr) # Tidyr provides us data manipulation functions

2.2 Data import

The next bit is importing the data and taking a look at it.

load(file = "/Volumes/tuti/IODS-final/data/wrangled_students.Rdata")
dim(rawdata)
## [1] 382  31

The data has 382 observations of 38 variables. Let’s next look at the variables.

str(rawdata)
## 'data.frame':    382 obs. of  31 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : Factor w/ 2 levels "High","Low": 1 2 2 1 1 1 2 1 1 1 ...
##  $ Fedu      : Factor w/ 2 levels "High","Low": 1 2 2 2 1 1 2 1 2 1 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: Factor w/ 4 levels "1","2","3","4": 2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : Factor w/ 4 levels "1","2","3","4": 2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : Factor w/ 5 levels "1","2","3","4",..: 4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : Factor w/ 5 levels "1","2","3","4",..: 4 3 2 2 2 2 4 4 2 1 ...
##  $ health    : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ G3_quart  : Factor w/ 4 levels "Q1","Q2","Q3",..: 1 1 2 4 2 4 2 1 4 4 ...

Further information on all of the original variables, as well as the origin of the data, is available from the source

A short recap of the variables in the wrangled dataset:

  • school: student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
  • sex: student’s sex (binary: ‘F’ - female or ‘M’ - male)
  • age: student’s age (numeric: from 15 to 22)
  • address: student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
  • famsize: family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
  • Pstatus: parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
  • Medu: mother’s education (binary: “high” = secondary or higher, “low” = lower than secondary)
  • Fedu: father’s education (binary: “high” = secondary or higher, “low” = lower than secondary)
  • Mjob: mother’s job (factor: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • Fjob: father’s job (factor: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • reason: reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
  • nursery: attended nursery school (binary: yes or no)
  • internet: Internet access at home (binary: yes or no)
  • guardian: student’s guardian (factor: ‘mother’, ‘father’ or ‘other’)
  • traveltime: home to school travel time (factor: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
  • studytime: weekly study time (factor: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  • failures: past class failures (binary: true or false)
  • schoolsup: extra educational support (binary: yes or no)
  • famsup: family educational support (binary: yes or no)
  • paid: extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  • activities: extra-curricular activities (binary: yes or no)
  • higher: wants to take higher education (binary: yes or no)
  • romantic: with a romantic relationship (binary: yes or no)
  • famrel: quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  • freetime: free time after school (numeric: from 1 - very low to 5 - very high)
  • goout: going out with friends (numeric: from 1 - very low to 5 - very high)
  • health: current health status (numeric: from 1 - very bad to 5 - very good)
  • absences: The student’s absences.
  • G3: final grade (numeric: from 0 to 20)
  • high_use: does the student use high amount of alcohol (binary: true or false)
  • G3_quart: The quartile of the student’s final grade based on G3.

Descriptions for unmodified variables are from the source mutatis mutandis factorised variables. Variables created during data wrangling are described by the author.

Finally, let’s see how the observations are divided in the data. We’ll use both the classical summary function and a graphical overview. As we’ve got lots of variables, we also define a higher width and height for the graphics to display nicely.

2.3 Exploring the variables

It pays to take a graphical look at the variables. People may have different ways of making sense of data containing multiple variables, but a graphical representation usually works at least for finding out whether there are some variables that are so skewed that they would also skew any further analysis.

summary(rawdata)
##  school   sex          age        address famsize   Pstatus   Medu    
##  GP:342   F:198   Min.   :15.00   R: 81   GT3:278   A: 38   High:230  
##  MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104   T:344   Low :152  
##                   Median :17.00                                       
##                   Mean   :16.59                                       
##                   3rd Qu.:17.00                                       
##                   Max.   :22.00                                       
##    Fedu           Mjob           Fjob            reason    nursery  
##  High:198   at_home : 53   at_home : 16   course    :140   no : 72  
##  Low :184   health  : 33   health  : 17   home      :110   yes:310  
##             other   :138   other   :211   other     : 34            
##             services: 96   services:107   reputation: 98            
##             teacher : 62   teacher : 31                             
##                                                                     
##  internet    guardian   traveltime studytime  failures       schoolsup
##  no : 58   father: 91   1:250      1:103     Mode :logical   no :331  
##  yes:324   mother:275   2:103      2:190     FALSE:316       yes: 51  
##            other : 16   3: 21      3: 62     TRUE :66                 
##                         4:  8      4: 27                              
##                                                                       
##                                                                       
##  famsup     paid     activities higher    romantic  famrel  freetime
##  no :144   no :205   no :181    no : 18   no :261   1:  9   1: 18   
##  yes:238   yes:177   yes:201    yes:364   yes:121   2: 18   2: 62   
##                                                     3: 66   3:156   
##                                                     4:183   4:109   
##                                                     5:106   5: 37   
##                                                                     
##  goout   health     absences            G3         high_use       G3_quart
##  1: 24   1: 46   Min.   : 0.000   Min.   : 0.00   Mode :logical   Q1:100  
##  2: 99   2: 43   1st Qu.: 0.000   1st Qu.: 8.00   FALSE:270       Q2:126  
##  3:123   3: 83   Median : 3.000   Median :11.00   TRUE :112       Q3: 82  
##  4: 82   4: 64   Mean   : 5.319   Mean   :10.39                   Q4: 74  
##  5: 54   5:146   3rd Qu.: 8.000   3rd Qu.:14.00                           
##                  Max.   :75.000   Max.   :20.00
gather(rawdata) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

From these overviews we can see deduce the following things that are pertinent for further analysis:

  • Only a small number of students have guardians outside their parents. On the other hand, co-guardianship has not been measured separately.
  • Only a small number of fathers either stay at home or work in healthcare.
  • When taken together with the mothers’ employment, the number of parents employed as teachers is ~12%, which is quite well above the OECD average of 3.94%, or the national front runner Iceland with 7.8% (data from 1999). Hence further exploration of the participating schools would be warranted to gauge their representativeness of the Portuguese students’ backgrounds.
  • A negligible number of students do not strive for higher education.
  • The number of students without Internet access from home is small, but non-negligible.

3 Hypotheses

The null hypothesis is that no significant categories can be established in the data. The alternate hypothesis is that the education levels of the parents are categorised with a high quartile of the final grade.


4 Multiple Correspondence Analysis

Having now acquainted ourselves with the contents of the dataset, it’s time to see whether there are any interesting structures hidden in the data.

We will use the quantitative (integer) variables (indices 3,28,29) as supplementary variables in the MCA, while we base the actual categorisation on the categorical variables. As before, we will mostly use wider and taller figures, as the amount of information we wish to display is quite large.

4.1 MCA Summaries

Our first task is to look at the summary of the MCA to see whether a low number of dimensions account for a high percentage of variance, which would indicate a significant finding in the analysis. In addition to the textual representation of the summary we will add a scree plot of the eigenvalues of the variables to determine a possible drop in the contribution levels to the variance.

mca <- MCA(rawdata, ncp = 5, quanti.sup=c(3,28,29), graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = rawdata, ncp = 5, quanti.sup = c(3, 28, 29), graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.109   0.086   0.065   0.063   0.061   0.060
## % of var.              5.557   4.390   3.319   3.206   3.124   3.066
## Cumulative % of var.   5.557   9.947  13.266  16.472  19.596  22.662
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.058   0.055   0.054   0.052   0.052   0.049
## % of var.              2.932   2.807   2.763   2.658   2.625   2.517
## Cumulative % of var.  25.594  28.401  31.165  33.823  36.448  38.965
##                       Dim.13  Dim.14  Dim.15  Dim.16  Dim.17  Dim.18
## Variance               0.048   0.045   0.044   0.043   0.042   0.041
## % of var.              2.432   2.298   2.249   2.199   2.145   2.096
## Cumulative % of var.  41.397  43.695  45.944  48.142  50.287  52.383
##                       Dim.19  Dim.20  Dim.21  Dim.22  Dim.23  Dim.24
## Variance               0.040   0.039   0.038   0.038   0.037   0.036
## % of var.              2.046   2.004   1.934   1.916   1.894   1.818
## Cumulative % of var.  54.429  56.433  58.366  60.283  62.177  63.995
##                       Dim.25  Dim.26  Dim.27  Dim.28  Dim.29  Dim.30
## Variance               0.035   0.034   0.033   0.033   0.031   0.031
## % of var.              1.761   1.722   1.681   1.669   1.597   1.578
## Cumulative % of var.  65.756  67.478  69.159  70.827  72.424  74.002
##                       Dim.31  Dim.32  Dim.33  Dim.34  Dim.35  Dim.36
## Variance               0.030   0.029   0.028   0.027   0.026   0.026
## % of var.              1.526   1.479   1.433   1.398   1.343   1.329
## Cumulative % of var.  75.527  77.006  78.438  79.837  81.180  82.508
##                       Dim.37  Dim.38  Dim.39  Dim.40  Dim.41  Dim.42
## Variance               0.025   0.024   0.023   0.023   0.022   0.021
## % of var.              1.271   1.203   1.194   1.172   1.120   1.067
## Cumulative % of var.  83.780  84.982  86.176  87.348  88.468  89.535
##                       Dim.43  Dim.44  Dim.45  Dim.46  Dim.47  Dim.48
## Variance               0.020   0.020   0.019   0.019   0.018   0.016
## % of var.              1.037   1.034   0.970   0.949   0.926   0.836
## Cumulative % of var.  90.572  91.606  92.575  93.525  94.451  95.287
##                       Dim.49  Dim.50  Dim.51  Dim.52  Dim.53  Dim.54
## Variance               0.015   0.015   0.015   0.014   0.013   0.012
## % of var.              0.774   0.762   0.757   0.697   0.652   0.609
## Cumulative % of var.  96.061  96.823  97.581  98.278  98.930  99.539
##                       Dim.55
## Variance               0.009
## % of var.              0.461
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##             Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1        |  0.070  0.012  0.002 | -0.258  0.203  0.029 | -0.217  0.189
## 2        |  0.288  0.199  0.060 | -0.388  0.458  0.109 | -0.002  0.000
## 3        |  0.357  0.306  0.064 | -0.303  0.278  0.046 | -0.301  0.363
## 4        | -0.425  0.432  0.101 | -0.130  0.051  0.009 |  0.229  0.210
## 5        | -0.147  0.052  0.019 | -0.291  0.257  0.073 | -0.138  0.077
## 6        | -0.467  0.523  0.188 |  0.156  0.074  0.021 |  0.106  0.045
## 7        |  0.180  0.077  0.028 | -0.068  0.014  0.004 | -0.157  0.099
## 8        | -0.057  0.008  0.001 | -0.294  0.262  0.029 | -0.448  0.806
## 9        | -0.350  0.293  0.071 | -0.025  0.002  0.000 |  0.044  0.008
## 10       | -0.238  0.136  0.034 |  0.234  0.166  0.032 |  0.018  0.001
##            cos2  
## 1         0.020 |
## 2         0.000 |
## 3         0.046 |
## 4         0.029 |
## 5         0.017 |
## 6         0.010 |
## 7         0.022 |
## 8         0.068 |
## 9         0.001 |
## 10        0.000 |
## 
## Categories (the 10 first)
##              Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## GP       |  -0.104   0.318   0.093  -5.945 |   0.003   0.000   0.000
## MS       |   0.891   2.718   0.093   5.945 |  -0.023   0.002   0.000
## F        |  -0.051   0.044   0.003  -1.031 |  -0.578   7.183   0.360
## M        |   0.055   0.047   0.003   1.031 |   0.622   7.730   0.360
## R        |   0.685   3.258   0.126   6.939 |  -0.160   0.224   0.007
## U        |  -0.184   0.877   0.126  -6.939 |   0.043   0.060   0.007
## GT3      |   0.013   0.004   0.000   0.415 |  -0.062   0.116   0.010
## LE3      |  -0.035   0.011   0.000  -0.415 |   0.166   0.310   0.010
## A        |  -0.331   0.356   0.012  -2.147 |   0.101   0.042   0.001
## T        |   0.037   0.039   0.012   2.147 |  -0.011   0.005   0.001
##           v.test     Dim.3     ctr    cos2  v.test  
## GP         0.156 |  -0.107   0.561   0.098  -6.103 |
## MS        -0.156 |   0.914   4.795   0.098   6.103 |
## F        -11.712 |  -0.207   1.212   0.046  -4.183 |
## M         11.712 |   0.222   1.304   0.046   4.183 |
## R         -1.616 |   0.519   3.131   0.073   5.257 |
## U          1.616 |  -0.140   0.842   0.073  -5.257 |
## GT3       -1.979 |   0.015   0.008   0.001   0.463 |
## LE3        1.979 |  -0.039   0.022   0.001  -0.463 |
## A          0.652 |  -0.477   1.238   0.025  -3.092 |
## T         -0.652 |   0.053   0.137   0.025   3.092 |
## 
## Categorical variables (eta2)
##            Dim.1 Dim.2 Dim.3  
## school   | 0.093 0.000 0.098 |
## sex      | 0.003 0.360 0.046 |
## address  | 0.126 0.007 0.073 |
## famsize  | 0.000 0.010 0.001 |
## Pstatus  | 0.012 0.001 0.025 |
## Medu     | 0.406 0.119 0.000 |
## Fedu     | 0.351 0.056 0.000 |
## Mjob     | 0.361 0.198 0.033 |
## Fjob     | 0.124 0.088 0.061 |
## reason   | 0.064 0.060 0.073 |
## 
## Supplementary continuous variables
##             Dim.1    Dim.2    Dim.3  
## age      |  0.268 |  0.046 | -0.018 |
## absences | -0.013 |  0.082 | -0.168 |
## G3       | -0.438 |  0.026 |  0.399 |
fviz_eig(mca, ncp = 10)

The MCA summary shows that the contributions of the dimensions to the variance are not especially high, with the first dimension contributing just ~5.6% and the second dimension only ~4.4%. Therefore the explanatory potential of the model is not very substantial, but we will nevertheless use the analysis to gauge what little we can extract from the modest contribution of the first two dimensions to the variance.

From the scree plot of the eigenvalues we can clearly see that the contribution to variance evens out after the first two dimensions. Hence limiting further analysis to two dimensions is justifiable.

We will next look at the contributions of the variables, at value level, to the first two dimensions. By doing this we can already get a feeling of the groupings we may expect when we later look at the dimensions and the observations in a more graphical fashion.

4.2 Contributions to dimensions

vars <- get_mca_var(mca)
head(vars$cos2, 8)
##            Dim 1        Dim 2        Dim 3      Dim 4       Dim 5
## GP  0.0927770952 6.348081e-05 0.0977595148 0.02426203 0.130303865
## MS  0.0927770952 6.348081e-05 0.0977595148 0.02426203 0.130303865
## F   0.0027895829 3.600425e-01 0.0459181801 0.01245582 0.031258784
## M   0.0027895829 3.600425e-01 0.0459181801 0.01245582 0.031258784
## R   0.1263757621 6.850096e-03 0.0725224026 0.03964565 0.120288337
## U   0.1263757621 6.850096e-03 0.0725224026 0.03964565 0.120288337
## GT3 0.0004516079 1.027673e-02 0.0005625846 0.05545224 0.001836966
## LE3 0.0004516079 1.027673e-02 0.0005625846 0.05545224 0.001836966
fviz_contrib(mca, choice = "var", axes = 1, top = 28)

In the contributions to the first dimension, we see the education variables dominating with failures.

fviz_contrib(mca, choice = "var", axes = 2, top = 28)

In the contributions to the second dimension, the sexes contribute much of the variance.

4.3 Variables in dimensions

corrplot(vars$cos2, is.corr=FALSE)

fviz_mca_var(mca, choice = "mca.cor", 
            repel = TRUE, # Avoid text overlapping (slow)
            ggtheme = theme_minimal())

4.4 Dimensions of the quantitative supplementary variables

fviz_mca_var(mca, choice = "quanti.sup",
             ggtheme = theme_minimal())

As we can see, the absences grow along dimension 1, while the final grade and age are nearly opposites on dimension 2.

4.5 MCA biplot

fviz_mca_biplot(mca, repel = TRUE, ggtheme = theme_minimal())

4.6 Variable categories in dimensions

fviz_mca_var(mca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, # avoid text overlapping (slow)
             ggtheme = theme_minimal()
             )

fviz_mca_ind(mca, col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE, # Avoid text overlapping (slow if many points)
             ggtheme = theme_minimal())

While the dimensions’ contributions to the total variance are not discernible, some of the values of the variables are relatively closely grouped together. For example, a high amount of free time seems to be closely matched with going out a lot, and both the mother’s and father’s employment as teachers seem to coincide. Similarly, a low amount of study time seems to be close to the high use of alcohol.

If we select values that are >??.5 on both dimensions, we end up with the following groupings in the four quadrants:

Top-left:

  • Father’s job is a teacher
  • Mother’s job is a teacher
  • Mother’s job is in healthcare
  • Mother has at least a secondary education
  • Student’s final grade is in the highest quartile

Top-right

  • Student doesn’t strive for higher education
  • Travel time is relatively high (4)
  • Study time is very low (1)
  • Student has failed courses
  • Student consumes a high amount of alcohol

Bottom-right

  • Mother has not completed secondary education
  • Student has no home Internet connection
  • Mother stays home
  • Student has a very low amount of free time (1)

Bottom-left

  • Father works in healthcare

While the dimensions explain a low portion of the variance, the groupings can be tentatively used for e.g. identifying students who would benefit from supportive measures in their education. We can also use other forms of analysis to measure whether some of the variables correlate to study results (as we have done in the next section).

4.7 Concentration ellipses

One quite intuitive and graphical way to spot whether our variables actually have explanatory power within their dimensions is to plot the variables on the two-dimensional space with concentration ellipses.

fviz_ellipses(mca, addEllipses=TRUE, "failures", geom = "point")

fviz_ellipses(mca, "Medu", geom = "point")

fviz_ellipses(mca, "Fedu", geom = "point")

fviz_ellipses(mca, "Mjob", geom = "point")

fviz_ellipses(mca, "Fjob", geom = "point")

fviz_ellipses(mca, "high_use", geom = "point")

It is obvious from the lack of individuals within the most of the ellipses that the variables are not very representative by themselves in the two-dimensional space.

To wrap things up, we will resort to a factor map of the values of variables.

plot(mca, habillage = "quali", invisible=c("ind"))

4.8 Summary of the MCA

As is abundantly evident already, the Multiple Correspondence Analysis failed to explain more than 10% of the variance for the first two easily graphically representable dimensions, and as we could see from the MCA summary in the early stage of our analysis, the cumulative contributions to the variance did not cross the 50% mark until the 17th dimension. Hence the original and constructed factor variables cannot be categorised in a statistically significant manner.

4.9 Hypothesis

As we did not find any significant categories within the data, the null hypothesis is proved as valid.


5 Extra analysis with other methods

Why stop with MCA where you’re having fun? Based on exploring the variables before we started further analysis, some unanswered questions on the connections of the variables crept up. Let’s try and see if we can shed some light on the families of the students.

plot(rawdata$Mjob~rawdata$Fjob, xlab="Mother's occupation", ylab="Father's occupation", main="Occupational homogeneity", cex = 0.5)

plot(rawdata$Medu~rawdata$Fedu, xlab="Mother's education level", ylab="Father's education level", main="Educational homogeneity", cex = 0.5) 

From these standard graphs we can deduce that the parents educational levels correlate quite heavily. We can try to certify this through logistic regression.

summary(glm(Fedu ~ Medu, data = rawdata, family = binomial))
## 
## Call:
## glm(formula = Fedu ~ Medu, family = binomial, data = rawdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8793  -0.7623  -0.7623   0.6125   1.6599  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0871     0.1518  -7.159 8.11e-13 ***
## MeduLow       2.6652     0.2635  10.113  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 529.05  on 381  degrees of freedom
## Residual deviance: 398.86  on 380  degrees of freedom
## AIC: 402.86
## 
## Number of Fisher Scoring iterations: 4

From the summary we can see that the mother’s low education is a strong predictor of the father’s education, and the two correlate quite heavily.

We would also like to find out whether the father’s or mother’s employment to a certain sector predicts the student’s final grade. For this we analyse both a graphical representation of the final grades for each sector and then compile a linear regression with the final grade as the dependent variable.

qplot(G3, data = rawdata, facets = Fjob~., geom = "freqpoly", binwidth = 1, xlab = "Final grade", ylab = "Students", main="Father's job and final grade")

summary(lm(G3 ~ Fjob, data = rawdata))
## 
## Call:
## lm(formula = G3 ~ Fjob, data = rawdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7097  -2.1896   0.6916   3.2802   9.6916 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.7500     1.1715   8.322  1.6e-15 ***
## Fjobhealth     1.7794     1.6323   1.090    0.276    
## Fjobother      0.4396     1.2151   0.362    0.718    
## Fjobservices   0.5584     1.2561   0.445    0.657    
## Fjobteacher    1.9597     1.4425   1.359    0.175    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.686 on 377 degrees of freedom
## Multiple R-squared:  0.01097,    Adjusted R-squared:  0.0004729 
## F-statistic: 1.045 on 4 and 377 DF,  p-value: 0.3837
qplot(G3, data = rawdata, facets = Mjob~., geom = "freqpoly", binwidth = 1, xlab = "Final grade", ylab = "Students", main="Mother's job and final grade")

summary(lm(G3 ~ Mjob, data = rawdata))
## 
## Call:
## lm(formula = G3 ~ Mjob, data = rawdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0606  -1.7609   0.2419   3.2391  10.1509 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.8491     0.6330  13.980  < 2e-16 ***
## Mjobhealth     3.2115     1.0219   3.143  0.00181 ** 
## Mjobother      0.9118     0.7447   1.224  0.22156    
## Mjobservices   2.4739     0.7886   3.137  0.00184 ** 
## Mjobteacher    1.9090     0.8621   2.214  0.02740 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.608 on 377 degrees of freedom
## Multiple R-squared:  0.04355,    Adjusted R-squared:  0.0334 
## F-statistic: 4.291 on 4 and 377 DF,  p-value: 0.002086
qplot(G3, data = rawdata, facets = Medu~Fedu, geom = "freqpoly", binwidth = 1, xlab = "Final grade", ylab = "Students")

summary(lm(G3 ~ Medu + Fedu, data = rawdata))
## 
## Call:
## lm(formula = G3 ~ Medu + Fedu, data = rawdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1489  -2.1489   0.7094   3.1140   9.7094 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.1489     0.3371  33.068   <2e-16 ***
## MeduLow      -1.5955     0.5852  -2.726   0.0067 ** 
## FeduLow      -0.2629     0.5733  -0.459   0.6468    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.619 on 379 degrees of freedom
## Multiple R-squared:  0.03391,    Adjusted R-squared:  0.02881 
## F-statistic: 6.651 on 2 and 379 DF,  p-value: 0.001449

Finally we’ll check graphically whether there seems to be any correlation between the number of absences and the final grade quartiles of the students.

qplot(absences, data = rawdata, facets = G3_quart~., geom = "freqpoly")

As most students have no absences independent of their quartile, it’s hard to determine any substantial correlation between absences and final grade from the graphical representation. The fourth quartile figure suggests, though, that the students with highest final grades did not have a substantial number of absences.